Abstract
Background and Aims Myelodysplastic syndromes (MDS) present significant diagnostic and therapeutic challenges, often requiring input from multidisciplinary teams and subspecialty-trained leukemia and pathology experts. The complexity is compounded by evolving classification systems (e.g., WHO and ICC) and an expanding therapeutic landscape, which complicate real-time clinical decision-making. Although large language models (LLMs) such as ChatGPT have demonstrated promise in medical domains, they frequently yield inaccurate or overly generalized responses when applied to complex hematologic scenarios. Even state-of-the-art models—capable of human-like reasoning and sometimes outperforming clinicians in general tasks—have not been systematically evaluated in the context of advanced hematologic disorders such as MDS.
To address this gap, we first assessed the performance of leading LLMs, including ChatGPT, Claude, and DeepSeek, on challenging, real-world MDS cases. After identifying their limitations, we built the Virtual MDS Panel (VMP), a coordinated AI system in which AI agents—task-bound software assistants that understand natural language and help users complete tasks, answer questions, and make decisions efficiently—are trained on domain knowledge (WHO/ICC; IPSS-R/IPSS-M; NCCN) and explicit decision rules to collaborate and produce tumor-board–level recommendations.
Methods VMP comprises four specialized AI agents: a moderator agent that receives clinical queries; a pathology agent trained on WHO/ICC criteria; a prognostication agent using risk models (IPSS, IPSS-R, IPSS-M) and a therapy agent grounded in NCCN and ELN guidelines. For each case, a physician submits a clinical scenario to the moderator, which breaks down the query and delegates sub-tasks to relevant agents. The moderator then synthesizes their responses into a structured output.
To evaluate performance, we created a test set of 30 complex, real-world MDS cases. VMP responses were compared to those from leading LLMs (ChatGPT-4o, GPT-o3, Claude, DeepSeek). Eleven international MDS experts, blinded to response sources, independently scored outputs for accuracy, clinical relevance, and completeness (Likert scale 1–5). They also assessed diagnostic reasoning, prognostic validity, and treatment recommendations, while classifying factual errors as none, minor, or major. To evaluate the consistency of expert ratings, we used the intraclass correlation coefficient (ICC) to measure how well experts agreed on numerical scores and Cohen's κ (kappa) to assess their agreement when identifying errors.
Results VMP achieved an overall expert-rated accuracy of 93%, outperforming GPT-o3 (82%), GPT-4o (80%), DeepSeek (71%), and Claude (66%). When examining each domain, the VMP overall scored highest 4.2 (mean on scale from 1-5 among all experts): Diagnosis 4.3, Prognosis 4.4 and therapy selection 3.9, compared to GPT-o3, at 3.6 overall, 3.7 / 3.6 / 3.6, GPT-4o (3.2 overall) 3.1 / 3.2 / 3.4, DeepSeek (3.0) 2.9 / 3.0 / 3.1, and Claude (2.9) 2.7 / 2.9 / 3.1, respectively.
According to expert opinion, the VMP registered just 9% major factual errors compared to GPT-o3 26%, GPT-4o 26%, DeepSeek 33%, and Claude 36%. Minor factual errors were present at 36% in VMP vs 47-52% in the other 4 models.
Experts showed strong agreement in their evaluations, with high consistency in scoring (ICC = 0.81) and in identifying AI errors or hallucinations (κ = 0.76), confirming the reliability of the review process.Conclusions We developed an advanced, MDS-focused AI system that increased accuracy and alignment with expert practice, outperforming current state-of-the-art general AI models. By emulating a virtual tumor board, the system offers structured, evidence-based guidance that can aid hematologists in optimizing their precision care.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal